K Nearest Neighbors Regression

Cluster-based Predictions

We have existing observations

\[(x_1, y_1), ... (x_n, y_n)\]


Given a new observation \(x_{new}\), how do we predict \(y_{new}\)?

  1. Find the 5 values in \((x_1, ..., x_n)\) that are closest to \(x_{new}\)
  1. Take the average of the corresponding \(y_i\)’s to our five closest \(x_i\)’s.
  1. Predict \(\widehat{y}_{new}\) = average of these 5 \(y_i\)’s

K-Nearest-Neighbors

To perform K-Nearest-Neighbors, we choose the K closest observations to our target, and we average their response values.


The Big Questions

  • What is our definition of closest?

  • What number should we use for K?

Non-Parametric Methods

K-Nearest-Neighbors is a non-parametric method.

That means…

  • We aren’t assuming anything about how our observations are generated or distributed.

  • We don’t even assume that the sample of observations is random!

  • We don’t impose any structure on our function f

  • KNN: f = average of 5 closest \(x_i\)’s that we observed

A Multipronged Approach

Recall:

  • regression is when we are trying to predict a numeric response

  • classification is when we are trying to predict a categorical response


K Nearest Neighbors can be used for both!

(but we’ll focus on regression for now)

Example

Recall from Assignment 1:

# A tibble: 6 × 6
    age sex      bmi smoker region    charges
  <dbl> <chr>  <dbl> <chr>  <chr>       <dbl>
1    19 female  27.9 yes    southwest  16885.
2    33 male    22.7 no     northwest  21984.
3    32 male    28.9 no     northwest   3867.
4    31 female  25.7 no     southeast   3757.
5    60 female  25.8 no     northwest  28923.
6    25 male    26.2 no     northeast   2721.

Establish Our Model

knn_mod <- nearest_neighbor(neighbors = 5) %>%
  set_engine("kknn") %>%
  set_mode("regression")



New engine - just take it from here: https://www.tidymodels.org/find/parsnip/

(You will have to install.packages("knn") if you are on your home computer.)

mode now matters a lot - “classification” would be possible too!

New model function nearest_neighbor(), which has a required neighbors argument specifying the number of neighbors to be used.

Fit Our Model

knn_fit_1 <- knn_mod %>%
  fit(charges ~ age, data = ins)

Inspect Our Model

knn_fit_1$fit %>% summary()

Call:
kknn::train.kknn(formula = charges ~ age, data = data, ks = min_rows(5,     data, 5))

Type of response variable: continuous
minimal mean absolute error: 8370.425
Minimal mean squared error: 128968111
Best kernel: optimal
Best k: 5




Choosing K

Check Your Intuition

  1. What happens if we use K = 1?

Not necessarily bad, but we could be thrown off by weird outlier observations!

  1. What happens if we use K = (number of observations)

We predict the same y-value no matter what!

Try it!

Open Activity-KNN-r.qmd

Use cross validation to choose between a KNN model with 5 neighbors that uses only age versus one that uses both age and bmi.

How do these models compare to the least-squares regression approach from Tuesday?

How do these models compare to a KNN model with 10 neighbors?



Dummy variables

Dummy variables

Suppose we now want to include region in our KNN model.

knn_fit_2 <- knn_mod %>%
  fit(charges ~ age + region, data = ins)

We can’t calculate a distance between categories!

Instead, we make dummy variables:

  • southwest = 1 if southwest, 0 if not
  • northwest = 1 if northwest, 0 if not
  • … etc

Now these are (sort of) numeric variables.

Creating a Recipe

Instead of manually changing the whole dataset, we can “teach” our model workflow what it needs to do to the data.

ins_rec <- recipe(charges ~ age + region, data = ins) %>%
  step_dummy(region)

Workflows

Now, we can combine our recipe (data processing instructions) and our model choice into a workflow:

ins_wflow <- workflow() %>%
  add_recipe(ins_rec) %>%
  add_model(knn_mod)

ins_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 5

Computational engine: kknn 

Fitting and Obtaining Statistics

ins_fit_region <- fit(ins_wflow, data = ins) 

ins_fit_region %>%
  pull_workflow_fit()
parsnip model object


Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5,     data, 5))

Type of response variable: continuous
minimal mean absolute error: 7585.732
Minimal mean squared error: 120283530
Best kernel: optimal
Best k: 5

Compare with Previous Model

knn_fit_1$fit %>% summary()

Call:
kknn::train.kknn(formula = charges ~ age, data = data, ks = min_rows(5,     data, 5))

Type of response variable: continuous
minimal mean absolute error: 8370.425
Minimal mean squared error: 128968111
Best kernel: optimal
Best k: 5

Think about it:

We didn’t get much benefit from adding region. But region does matter to the response variable! Why?



Standardizing

Range of Values – A Comparison

  • What is the largest and smallest value of a dummy variable?
  • What is the largest and smallest value of age?
summarize(ins, 
          max_age = max(age), 
          min_age = min(age)
          )
# A tibble: 1 × 2
  max_age min_age
    <dbl>   <dbl>
1      64      18

Standardizing

What is the distance between:

  • Person A: 20 years old, from the southwest
  • Person B: 20 years old, from the northeast

Remember how we coded the regions!

What is the distance between:

  • Person A: 20 years old, from the southwest
  • Person B: 23 years old, from the southwest

Finding a Comparable Scale

Let’s put age on a scale that is comparable to the dummy variables.

How about: mean of 0, standard deviation of 1

Does this sound like anything you’ve heard of before?

This is called normalizing a variable.

Normalizing

Add it to the workflow!

ins_rec <- recipe(charges ~ age + region, data = ins) %>%
  step_dummy(region) %>%
  step_normalize(age)

ins_wflow <- workflow() %>%
  add_recipe(ins_rec) %>%
  add_model(knn_mod)

Normalizing

Inspect the Workflow

ins_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 5

Computational engine: kknn 

Fitting and Obtaining Statistics

ins_fit_region_age <- fit(ins_wflow, ins) 

ins_fit_region_age %>% pull_workflow_fit()
parsnip model object


Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5,     data, 5))

Type of response variable: continuous
minimal mean absolute error: 7508.632
Minimal mean squared error: 118115709
Best kernel: optimal
Best k: 5

Compare with Previous Model

ins_fit_region %>%
  pull_workflow_fit()
parsnip model object


Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5,     data, 5))

Type of response variable: continuous
minimal mean absolute error: 7585.732
Minimal mean squared error: 120283530
Best kernel: optimal
Best k: 5

Try it!

Open Activity-KNN-r.qmd again.

  1. Make a KNN model with K = 5, using age, bmi, smoker, and sex.

  2. Compare the model with non-normalized variables to one with normalized variables. Which is better?




Tuning

Tuning

K is what is called a tuning parameter.

This is a feature of a model that we have to chose before fitting the model.

Ideally, we’d try many values of the tuning parameter and find the best one.

Automatic tuning

knn_mod <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("regression")

k_grid <- grid_regular(neighbors(), levels = 5)

k_grid
# A tibble: 5 × 1
  neighbors
      <int>
1         1
2         3
3         5
4         7
5        10

Automatic tuning

knn_mod_tune <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("regression")

k_grid <- grid_regular(neighbors(c(1, 50)), 
                       levels = 25)

k_grid
# A tibble: 25 × 1
   neighbors
       <int>
 1         1
 2         3
 3         5
 4         7
 5         9
 6        11
 7        13
 8        15
 9        17
10        19
# ℹ 15 more rows

Using Cross Validation to Tune

ins_rec <- recipe(charges ~ age + bmi + sex + smoker, data = ins) %>%
  step_dummy(all_nominal()) %>%
  step_normalize(all_numeric())

ins_wflow <- workflow() %>%
  add_recipe(ins_rec) %>%
  add_model(knn_mod_tune)


ins_cv <- vfold_cv(ins, v = 10)

knn_grid_search <- tune_grid(ins_wflow,
                             resamples = ins_cv,
                             grid = k_grid
                             )

Tuning

knn_grid_search
# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits           id     .metrics          .notes          
   <list>           <chr>  <list>            <list>          
 1 <split [387/44]> Fold01 <tibble [50 × 5]> <tibble [0 × 3]>
 2 <split [388/43]> Fold02 <tibble [50 × 5]> <tibble [0 × 3]>
 3 <split [388/43]> Fold03 <tibble [50 × 5]> <tibble [0 × 3]>
 4 <split [388/43]> Fold04 <tibble [50 × 5]> <tibble [0 × 3]>
 5 <split [388/43]> Fold05 <tibble [50 × 5]> <tibble [0 × 3]>
 6 <split [388/43]> Fold06 <tibble [50 × 5]> <tibble [0 × 3]>
 7 <split [388/43]> Fold07 <tibble [50 × 5]> <tibble [0 × 3]>
 8 <split [388/43]> Fold08 <tibble [50 × 5]> <tibble [0 × 3]>
 9 <split [388/43]> Fold09 <tibble [50 × 5]> <tibble [0 × 3]>
10 <split [388/43]> Fold10 <tibble [50 × 5]> <tibble [0 × 3]>

Tuning

knn_grid_search %>% collect_metrics()
# A tibble: 50 × 7
   neighbors .metric .estimator  mean     n std_err .config              
       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1         1 rmse    standard   0.519    10  0.0283 Preprocessor1_Model01
 2         1 rsq     standard   0.749    10  0.0253 Preprocessor1_Model01
 3         3 rmse    standard   0.437    10  0.0247 Preprocessor1_Model02
 4         3 rsq     standard   0.804    10  0.0233 Preprocessor1_Model02
 5         5 rmse    standard   0.409    10  0.0260 Preprocessor1_Model03
 6         5 rsq     standard   0.822    10  0.0257 Preprocessor1_Model03
 7         7 rmse    standard   0.396    10  0.0265 Preprocessor1_Model04
 8         7 rsq     standard   0.832    10  0.0260 Preprocessor1_Model04
 9         9 rmse    standard   0.388    10  0.0267 Preprocessor1_Model05
10         9 rsq     standard   0.838    10  0.0262 Preprocessor1_Model05
# ℹ 40 more rows

Tuning

Tuning

What if we had only looked at k from 1 to 10?

Tuning

What if we had only looked at k from 20 to 50?

Tuning

knn_grid_search %>% 
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  slice_min(mean)
# A tibble: 1 × 7
  neighbors .metric .estimator  mean     n std_err .config              
      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1        13 rmse    standard   0.381    10  0.0273 Preprocessor1_Model07

Try it!

Open Activity-KNN-r.qmd again.

  1. Decide on a best final KNN model to predict insurance charges.

  2. Plot the residuals from this model.